In this python project, I did an Exploratory Data Analysis using Pandas, a python library. Here is a step to step walk through of the project.
I started by importing the relevant python libraries that will be used to create this system. These libraries are; pandas, seaborn and matplotlib.pyplot. Pandas is a powerful data manipulation and analysis library for Python. Seaborn is a statistical data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. pyplot is a module in Matplotlib that provides a MATLAB-like interface for plotting.
The next step involved defining the path where the data was stored, importing the data and storing it in a data frame.
I then standardized the data in our data frame to only round off and display all numbers to two decimal places.
I then went ahead to find out quick basic information about our data. The data had a total of 234 rows and 17 columns and the total number of missing values for each column indicated. The data types present in the data are objects, integers and floats.
After finding out the basic information about the data, I then did some quick summary statistics of the data. This is a good first step to draw some quick analysis of the data.
I then wanted to know how many missing values are in our data and this can be useful for data cleaning to know one, if they are there and two, what to do about it.
After finding out if there are null values, I also wanted to know the unique values in each column and together with the null value findings know if there is any data replication. For such data, It is hard to find repeating data as each country is unique in every metric. So if there are repeating values, we need to find out why and perhaps correct it.
I then went ahead to find out top 5 countries with the highest populations in the world as at 2022. The first three as probably guessed were China, India and United States and interestingly Indonesia and Pakistan falling in number 4 and 5 respectively of the countries with the highest population count.
I then tried to find a correlation between the columns. The diagonal values are all 1 because each column is perfectly correlated with itself.
Population columns are highly correlated with each other (values close to 1), indicating that countries with large populations in one year tend to have large populations in other years. This is expected as population sizes tend to change gradually over time. Area (km²) has a moderate positive correlation with population columns (around 0.45-0.53). Larger countries by area tend to have larger populations. Density (per km²) has very low correlation values with most columns, indicating that population density is not strongly related to the absolute population numbers or the area. Growth Rate has low correlation with other columns, suggesting that population growth rate is somewhat independent of the current population size, rank, and area. World Population Percentage is highly correlated with the population columns (values close to 1), as expected since it is directly derived from population numbers.
I then tried to draw analysis on a continent level. This involved summarizing and comparing average numeric attributes across different continents. This provides insights into how population characteristics (size, density, growth rate, etc.) vary across continents based on the available data. This analysis can be helpful in understanding regional trends and patterns in population dynamics. We then focused on the 2022 data which is the latest data in our population data.
For curiosity I delved into analysis of the Oceania continent to find out which countries countries constitute Oceania and their Population and Size distribution.
I then find tried to find out the mean population distribution of each continent over the years. I then transposed the data and stored it in a new dataframe to be used later to plot a visual graph. Before plotting the graph, I try to find out the column names in our data frame for reference and accuracy of our graph. I then plot the population distribution line graph for visualization.
To download and view the full project on GitHub, click here.